725 research outputs found
Randomized Robust Subspace Recovery for High Dimensional Data Matrices
This paper explores and analyzes two randomized designs for robust Principal
Component Analysis (PCA) employing low-dimensional data sketching. In one
design, a data sketch is constructed using random column sampling followed by
low dimensional embedding, while in the other, sketching is based on random
column and row sampling. Both designs are shown to bring about substantial
savings in complexity and memory requirements for robust subspace learning over
conventional approaches that use the full scale data. A characterization of the
sample and computational complexity of both designs is derived in the context
of two distinct outlier models, namely, sparse and independent outlier models.
The proposed randomized approach can provably recover the correct subspace with
computational and sample complexity that are almost independent of the size of
the data. The results of the mathematical analysis are confirmed through
numerical simulations using both synthetic and real data
Data Dropout in Arbitrary Basis for Deep Network Regularization
An important problem in training deep networks with high capacity is to
ensure that the trained network works well when presented with new inputs
outside the training dataset. Dropout is an effective regularization technique
to boost the network generalization in which a random subset of the elements of
the given data and the extracted features are set to zero during the training
process. In this paper, a new randomized regularization technique in which we
withhold a random part of the data without necessarily turning off the
neurons/data-elements is proposed. In the proposed method, of which the
conventional dropout is shown to be a special case, random data dropout is
performed in an arbitrary basis, hence the designation Generalized Dropout. We
also present a framework whereby the proposed technique can be applied
efficiently to convolutional neural networks. The presented numerical
experiments demonstrate that the proposed technique yields notable performance
gain. Generalized Dropout provides new insight into the idea of dropout, shows
that we can achieve different performance gains by using different bases
matrices, and opens up a new research question as of how to choose optimal
bases matrices that achieve maximal performance gain
Innovation Pursuit: A New Approach to Subspace Clustering
In subspace clustering, a group of data points belonging to a union of
subspaces are assigned membership to their respective subspaces. This paper
presents a new approach dubbed Innovation Pursuit (iPursuit) to the problem of
subspace clustering using a new geometrical idea whereby subspaces are
identified based on their relative novelties. We present two frameworks in
which the idea of innovation pursuit is used to distinguish the subspaces.
Underlying the first framework is an iterative method that finds the subspaces
consecutively by solving a series of simple linear optimization problems, each
searching for a direction of innovation in the span of the data potentially
orthogonal to all subspaces except for the one to be identified in one step of
the algorithm. A detailed mathematical analysis is provided establishing
sufficient conditions for iPursuit to correctly cluster the data. The proposed
approach can provably yield exact clustering even when the subspaces have
significant intersections. It is shown that the complexity of the iterative
approach scales only linearly in the number of data points and subspaces, and
quadratically in the dimension of the subspaces. The second framework
integrates iPursuit with spectral clustering to yield a new variant of
spectral-clustering-based algorithms. The numerical simulations with both real
and synthetic data demonstrate that iPursuit can often outperform the
state-of-the-art subspace clustering algorithms, more so for subspaces with
significant intersections, and that it significantly improves the
state-of-the-art result for subspace-segmentation-based face clustering
High Dimensional Low Rank plus Sparse Matrix Decomposition
This paper is concerned with the problem of low rank plus sparse matrix
decomposition for big data. Conventional algorithms for matrix decomposition
use the entire data to extract the low-rank and sparse components, and are
based on optimization problems with complexity that scales with the dimension
of the data, which limits their scalability. Furthermore, existing randomized
approaches mostly rely on uniform random sampling, which is quite inefficient
for many real world data matrices that exhibit additional structures (e.g.
clustering). In this paper, a scalable subspace-pursuit approach that
transforms the decomposition problem to a subspace learning problem is
proposed. The decomposition is carried out using a small data sketch formed
from sampled columns/rows. Even when the data is sampled uniformly at random,
it is shown that the sufficient number of sampled columns/rows is roughly
O(r\mu), where \mu is the coherency parameter and r the rank of the low rank
component. In addition, adaptive sampling algorithms are proposed to address
the problem of column/row sampling from structured data. We provide an analysis
of the proposed method with adaptive sampling and show that adaptive sampling
makes the required number of sampled columns/rows invariant to the distribution
of the data. The proposed approach is amenable to online implementation and an
online scheme is proposed.Comment: IEEE Transactions on Signal Processin
Spatial Random Sampling: A Structure-Preserving Data Sketching Tool
Random column sampling is not guaranteed to yield data sketches that preserve
the underlying structures of the data and may not sample sufficiently from
less-populated data clusters. Also, adaptive sampling can often provide
accurate low rank approximations, yet may fall short of producing descriptive
data sketches, especially when the cluster centers are linearly dependent.
Motivated by that, this paper introduces a novel randomized column sampling
tool dubbed Spatial Random Sampling (SRS), in which data points are sampled
based on their proximity to randomly sampled points on the unit sphere. The
most compelling feature of SRS is that the corresponding probability of
sampling from a given data cluster is proportional to the surface area the
cluster occupies on the unit sphere, independently from the size of the cluster
population. Although it is fully randomized, SRS is shown to provide
descriptive and balanced data representations. The proposed idea addresses a
pressing need in data science and holds potential to inspire many novel
approaches for analysis of big data
Subspace Clustering via Optimal Direction Search
This letter presents a new spectral-clustering-based approach to the subspace
clustering problem. Underpinning the proposed method is a convex program for
optimal direction search, which for each data point d finds an optimal
direction in the span of the data that has minimum projection on the other data
points and non-vanishing projection on d. The obtained directions are
subsequently leveraged to identify a neighborhood set for each data point. An
alternating direction method of multipliers framework is provided to
efficiently solve for the optimal directions. The proposed method is shown to
notably outperform the existing subspace clustering methods, particularly for
unwieldy scenarios involving high levels of noise and close subspaces, and
yields the state-of-the-art results for the problem of face clustering using
subspace segmentation
Robust, Scalable, and Provable Approaches to High Dimensional Unsupervised Learning
This doctoral thesis focuses on three popular unsupervised learning problems: subspace clustering, robust PCA, and column sampling. For the subspace clustering problem, a new transformative idea is presented. The proposed approach, termed Innovation Pursuit, is a new geometrical solution to the subspace clustering problem whereby subspaces are identified based on their relative novelties. A detailed mathematical analysis is provided establishing sufficient conditions for the proposed method to correctly cluster the data points. The numerical simulations with both real and synthetic data demonstrate that Innovation Pursuit notably outperforms the state-of-the-art subspace clustering algorithms. For the robust PCA problem, we focus on both the outlier detection and the matrix decomposition problems. For the outlier detection problem, we present a new algorithm, termed Coherence Pursuit, in addition to two scalable randomized frameworks for the implementation of outlier detection algorithms. The Coherence Pursuit method is the first provable and non-iterative robust PCA method which is provably robust to both unstructured and structured outliers. Coherence Pursuit is remarkably simple and it notably outperforms the existing methods in dealing with structured outliers. In the proposed randomized designs, we leverage the low dimensional structure of the low rank component to apply the robust PCA algorithm to a random sketch of the data as opposed to the full scale data. Importantly, it is analytically shown that the presented randomized designs can make the computation or sample complexity of the low rank matrix recovery algorithm independent of the size of the data. At the end, we focus on the column sampling problem. A new sampling tool, dubbed Spatial Random Sampling, is presented which performs the random sampling in the spatial domain. The most compelling feature of Spatial Random Sampling is that it is the first unsupervised column sampling method which preserves the spatial distribution of the data
Scalable and Robust Community Detection with Randomized Sketching
This paper explores and analyzes the unsupervised clustering of large
partially observed graphs. We propose a scalable and provable randomized
framework for clustering graphs generated from the stochastic block model. The
clustering is first applied to a sub-matrix of the graph's adjacency matrix
associated with a reduced graph sketch constructed using random sampling. Then,
the clusters of the full graph are inferred based on the clusters extracted
from the sketch using a correlation-based retrieval step. Uniform random node
sampling is shown to improve the computational complexity over clustering of
the full graph when the cluster sizes are balanced. A new random degree-based
node sampling algorithm is presented which significantly improves upon the
performance of the clustering algorithm even when clusters are unbalanced. This
algorithm improves the phase transitions for matrix-decomposition-based
clustering with regard to computational complexity and minimum cluster size,
which are shown to be nearly dimension-free in the low inter-cluster
connectivity regime. A third sampling technique is shown to improve balance by
randomly sampling nodes based on spatial distribution. We provide analysis and
numerical results using a convex clustering algorithm based on matrix
completion
- …